Building a Dataset of Multilingual Cognates for the Romanian Lexicon
نویسندگان
چکیده
Identifying cognates is an interesting task with applications in numerous research areas, such as historical and comparative linguistics, language acquisition, cross-lingual information retrieval, readability and machine translation. We propose a dictionary-based approach to identifying cognates based on etymology and etymons. We account for relationships between languages and we extract etymology-related information from electronic dictionaries. We employ the dataset of cognates that we obtain as a gold standard for evaluating to which extent orthographic methods can be used to detect cognate pairs. The question that arises is whether they are able to discriminate between cognates and non-cognates, given the orthographic changes undergone by foreign words when entering new languages. We investigate some orthographic approaches widely used in this research area and some original metrics as well. We run our experiments on the Romanian lexicon, but the method we propose is adaptable to any language, as far as resources are available.
منابع مشابه
Exploring the Mental Lexicon of the Multilingual: Vocabulary Size, Cognate Recognition and Lexical Access in the L1, L2 and L3
Recent empirical findings in the field of Multilingualism have shown that the mental lexicon of a language learner does not consist of separate entities, but rather of an intertwined system where languages can interact with each other (e.g. Cenoz, 2013; Szubko-Sitarek, 2015). Accordingly, multilingual language learners have been considered differently to second language learners in a growing nu...
متن کاملMental Representation of Cognates/Noncognates in Persian-Speaking EFL Learners
The purpose of this study was to investigate the mental representation of cognate and noncognate translation pairs in languages with different scripts to test the prediction of dual lexicon model (Gollan, Forster, & Frost, 1997). Two groups of Persian-speaking English language learners were tested on cognate and noncognate translation pairs in Persian-English and English-Persian directions with...
متن کاملA Persian-English Cross-Linguistic Dataset for Research on the Visual Processing of Cognates and Noncognates
Finding out which lexico-semantic features of cognates are critical in cross-language studies and comparing these features with noncognates helps researchers to decide which features to control in studies with cognates. Normative databases provide necessary information for this purpose. Such resources are lacking in the Persian language. We created a dataset and determined norms for the essenti...
متن کاملMULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the fourth, “Mondilex” edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specif...
متن کاملFirst Steps in Building a Verb Valency Lexicon for Romanian
This paper presents some steps in manually building a verb valency lexicon for Romanian. We refer to some major previous works by focusing on their information representation. We select that information for different stages of our project and we show the conceptual problems encountered during the first phase. Finally we present the gradually building procedure of the lexicon and we exemplify th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014